CollateX and XML, Part 1

David J. Birnbaum (djbpitt@gmail.com, http://www.obdurodon.org), 2015-06-29

This is the first part of multi-part tutorial on processing XML with CollateX (http://collatex.net). This example collates a single line of XML from four witnesses. It spells out the details step by step in a way that would not be used in a real project, but that makes it easy to see how each step moves toward the final result. The output is in the three formats supported natively by CollateX: a plain-text alignment table, JSON, and colored HTML.

Still to come:

Part 2: Restructuring the code to use Python classes
Part 3: Reading multiline input from files
Part 4: Creating output in generic XML, suitable for transformation into TEI or other XML formats.
Part 5: Fine-tuning the input to improve tokenization, normalization, and alignment
Part 6: Quicker processing with Python multiprocessing

Not planned: Post-processing of generic XML output, which is best done separately with XSLT 2.0.

Load libraries



In [1]:

    
from collatex import *
from lxml import etree
import json,re

Create XSLT stylesheets and functions to use them



In [2]:

    
addWMilestones = etree.XML("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
    <xsl:template match="*|@*">
        <xsl:copy>
            <xsl:apply-templates select="node() | @*"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="/*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <!-- insert a <w/> milestone before the first word -->
            <w/>
            <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>
    <!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way)
         CUSTOMIZE HERE: add other elements that may span multiple word tokens
    -->
    <xsl:template match="add | sic | crease ">
        <xsl:element name="{name()}">
            <xsl:attribute name="n">start</xsl:attribute>
        </xsl:element>
        <xsl:apply-templates/>
        <xsl:element name="{name()}">
            <xsl:attribute name="n">end</xsl:attribute>
        </xsl:element>
    </xsl:template>
    <xsl:template match="note"/>
    <xsl:template match="text()">
        <xsl:call-template name="whiteSpace">
            <xsl:with-param name="input" select="translate(.,'&#x0a;',' ')"/>
        </xsl:call-template>
    </xsl:template>
    <xsl:template name="whiteSpace">
        <xsl:param name="input"/>
        <xsl:choose>
            <xsl:when test="not(contains($input, ' '))">
                <xsl:value-of select="$input"/>
            </xsl:when>
            <xsl:when test="starts-with($input,' ')">
                <xsl:call-template name="whiteSpace">
                    <xsl:with-param name="input" select="substring($input,2)"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="substring-before($input, ' ')"/>
                <w/>
                <xsl:call-template name="whiteSpace">
                    <xsl:with-param name="input" select="substring-after($input,' ')"/>
                </xsl:call-template>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>
</xsl:stylesheet>

""")
transformAddW = etree.XSLT(addWMilestones)
                           
xsltWrapW = etree.XML('''
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
    <xsl:template match="/*">
        <xsl:copy>
            <xsl:apply-templates select="w"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="w">
        <!-- faking <xsl:for-each-group> as well as the "<<" and except" operators -->
        <xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/>
        <w>
            <xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/>
        </w>
    </xsl:template>
</xsl:stylesheet>
''')
transformWrapW = etree.XSLT(xsltWrapW)

Create and examine XML data



In [3]:

    
A = """<l><abbrev>Et</abbrev>cil i partent seulement</l>"""
B = """<l><abbrev>Et</abbrev>cil i p<abbrev>er</abbrev>dent ausem<abbrev>en</abbrev>t</l>"""
C = """<l><abbrev>Et</abbrev>cil i p<abbrev>ar</abbrev>tent seulema<abbrev>n</abbrev>t</l>"""
D = """<l>E cil i partent sulement</l>"""

ATree = etree.XML(A)
BTree = etree.XML(B)
CTree = etree.XML(C)
DTree = etree.XML(D)

print(A)
print(ATree)









    



<l><abbrev>Et</abbrev>cil i partent seulement</l>
<Element l at 0x10daa5988>

Tokenize XML input by adding <w> tags and examine the results



In [4]:

    
ATokenized = transformWrapW(transformAddW(ATree))
BTokenized = transformWrapW(transformAddW(BTree))
CTokenized = transformWrapW(transformAddW(CTree))
DTokenized = transformWrapW(transformAddW(DTree))

print(ATokenized)









    



<l><w><abbrev>Et</abbrev>cil</w><w>i</w><w>partent</w><w>seulement</w></l>

Function to convert the word-tokenized witness line into JSON



In [5]:

    
def XMLtoJSON(id,XMLInput):
    unwrapRegex = re.compile('<w>(.*)</w>')
    stripTagsRegex = re.compile('<.*?>')
    words = XMLInput.xpath('//w')
    witness = {}
    witness['id'] = id
    witness['tokens'] = []
    for word in words:
        unwrapped = unwrapRegex.match(etree.tostring(word,encoding='unicode')).group(1)
        token = {}
        token['t'] = unwrapped
        token['n'] = stripTagsRegex.sub('',unwrapped.lower())
        witness['tokens'].append(token)
    return witness

Use the function to create JSON input for CollateX, and examine it



In [6]:

    
json_input = {}
json_input['witnesses'] = []
json_input['witnesses'].append(XMLtoJSON('A',ATokenized))
json_input['witnesses'].append(XMLtoJSON('B',BTokenized))
json_input['witnesses'].append(XMLtoJSON('C',CTokenized))
json_input['witnesses'].append(XMLtoJSON('D',DTokenized))
print(json_input)









    



{'witnesses': [{'tokens': [{'n': 'etcil', 't': '<abbrev>Et</abbrev>cil'}, {'n': 'i', 't': 'i'}, {'n': 'partent', 't': 'partent'}, {'n': 'seulement', 't': 'seulement'}], 'id': 'A'}, {'tokens': [{'n': 'etcil', 't': '<abbrev>Et</abbrev>cil'}, {'n': 'i', 't': 'i'}, {'n': 'perdent', 't': 'p<abbrev>er</abbrev>dent'}, {'n': 'ausement', 't': 'ausem<abbrev>en</abbrev>t'}], 'id': 'B'}, {'tokens': [{'n': 'etcil', 't': '<abbrev>Et</abbrev>cil'}, {'n': 'i', 't': 'i'}, {'n': 'partent', 't': 'p<abbrev>ar</abbrev>tent'}, {'n': 'seulemant', 't': 'seulema<abbrev>n</abbrev>t'}], 'id': 'C'}, {'tokens': [{'n': 'e', 't': 'E'}, {'n': 'cil', 't': 'cil'}, {'n': 'i', 't': 'i'}, {'n': 'partent', 't': 'partent'}, {'n': 'sulement', 't': 'sulement'}], 'id': 'D'}]}

Collate the witnesses and view the output as JSON, in a table, and as colored HTML



In [7]:

    
collationText = collate_pretokenized_json(json_input,output='table',layout='vertical')
print(collationText)
collationJSON = collate_pretokenized_json(json_input,output='json')
print(collationJSON)
collationHTML2 = collate_pretokenized_json(json_input,output='html2')









    



+----------------------+----------------------+----------------------+----------+
|          A           |          B           |          C           |    D     |
+----------------------+----------------------+----------------------+----------+
| <abbrev>Et</abbrev>c | <abbrev>Et</abbrev>c | <abbrev>Et</abbrev>c |    E     |
|          il          |          il          |          il          |          |
+----------------------+----------------------+----------------------+----------+
|          -           |          -           |          -           |   cil    |
+----------------------+----------------------+----------------------+----------+
|          i           |          i           |          i           |    i     |
+----------------------+----------------------+----------------------+----------+
|       partent        | p<abbrev>er</abbrev> | p<abbrev>ar</abbrev> | partent  |
|                      |         dent         |         tent         |          |
+----------------------+----------------------+----------------------+----------+
|      seulement       | ausem<abbrev>en</abb | seulema<abbrev>n</ab | sulement |
|                      |        rev>t         |        brev>t        |          |
+----------------------+----------------------+----------------------+----------+
{"table": [[[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [null], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "partent"}], [{"n": "seulement", "t": "seulement"}]], [[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [null], [{"n": "i", "t": "i"}], [{"n": "perdent", "t": "p<abbrev>er</abbrev>dent"}], [{"n": "ausement", "t": "ausem<abbrev>en</abbrev>t"}]], [[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [null], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "p<abbrev>ar</abbrev>tent"}], [{"n": "seulemant", "t": "seulema<abbrev>n</abbrev>t"}]], [[{"n": "e", "t": "E"}], [{"n": "cil", "t": "cil"}], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "partent"}], [{"n": "sulement", "t": "sulement"}]]], "witnesses": ["A", "B", "C", "D"]}






    





 
  A
  B
  C
  D
 
 
  etcil
  etcil
  etcil
  e
 
 
  -
  -
  -
  cil
 
 
  i
  i
  i
  i
 
 
  partent
  perdent
  partent
  partent
 
 
  seulement
  ausement
  seulemant
  sulement

Here’s what would have happened without stripping the XML markup:



In [8]:

    
collation = Collation()
collation.add_plain_witness('A',A)
collation.add_plain_witness('B',B)
collation.add_plain_witness('C',C)
collation.add_plain_witness('D',D)
print(collate(collation,output='table',layout='vertical'))









    



+--------------+--------------+--------------+----------+
|      A       |      B       |      C       |    D     |
+--------------+--------------+--------------+----------+
|     < l      |     < l      |     < l      |   < l    |
+--------------+--------------+--------------+----------+
|  >< abbrev   |  >< abbrev   |  >< abbrev   |    -     |
+--------------+--------------+--------------+----------+
|      >       |      >       |      >       |    >     |
+--------------+--------------+--------------+----------+
| Et</ abbrev> | Et</ abbrev> | Et</ abbrev> |    E     |
+--------------+--------------+--------------+----------+
|    cil i     |    cil i     |    cil i     |  cil i   |
+--------------+--------------+--------------+----------+
|   partent    |      p<      |      p<      | partent  |
+--------------+--------------+--------------+----------+
|  seulement   |    abbrev    |    abbrev    | sulement |
+--------------+--------------+--------------+----------+
|      -       |      >       |      >       |    -     |
+--------------+--------------+--------------+----------+
|      -       |      er      |      ar      |    -     |
+--------------+--------------+--------------+----------+
|      -       |  </ abbrev>  |  </ abbrev>  |    -     |
+--------------+--------------+--------------+----------+
|      -       |  dent ausem  | tent seulema |    -     |
+--------------+--------------+--------------+----------+
|      -       |  < abbrev>   |  < abbrev>   |    -     |
+--------------+--------------+--------------+----------+
|      -       |      en      |      n       |    -     |
+--------------+--------------+--------------+----------+
|      -       | </ abbrev> t | </ abbrev> t |    -     |
+--------------+--------------+--------------+----------+
|    </ l>     |    </ l>     |    </ l>     |  </ l>   |
+--------------+--------------+--------------+----------+

A	B	C	D
etcil	etcil	etcil	e
-	-	-	cil
i	i	i	i
partent	perdent	partent	partent
seulement	ausement	seulemant	sulement